QTM 447 Lecture 4: Linear Models and Likelihood

Kevin McAlister

January 23, 2025

Administrative Stuff

Problem Set #1 will be posted by end of day today!

  • Will be due by the start of class on February 4th

Three problems:

  • Showing some properties of L2 regularization when P > N

  • Implementing cyclic coordinate descent as a method for finding the optimal solution for the LASSO

  • Building classifiers for a disease prediction data set

Administrative Stuff

You may work in teams of up to three students

  • Each student must turn in a copy of the solutions on Canvas by the deadline

  • Note your teammates at the top of your solutions

Two files (minimum) for your submission

  • .ipynb or similar file

  • Rendered .html or .pdf of your solutions

  • Please don’t make me start a .ipynb server for your submission!

Administrative Stuff

For proof style questions, you can probably get a lot of the solutions via ChatGPT

  • This is fine if you’re using it to check solutions or trying to kickstart your own work on the problem

Why take this class if you’re just going to let ChatGPT solve the problems?

  • I think the results for the pen-and-paper parts are interesting and will force you to think differently about results related to regularization that you took as a given in your previous ML classes

  • Also, they’re just kinda cool!

Administrative Stuff

For coding exercises, please use ChatGPT/Copilot as much as your little heart desires

  • It’s just the way the world works now!

  • sklearn is annoying in its setup

  • matplotlib is even worse

Predictive Models

In the last couple of classes, we discussed the generic supervised predictive problem

Let’s assume that we have some outcome, y, we would like to predict and that each outcome is generated as a function of some features, \mathbf x.

Specifically: y = f(\mathbf x) + \epsilon

where f(\mathbf x) is some arbitrarily complex function.

Predictive Models

We want to use the N instances of training data that we receive to learn f(\mathbf x) to the best of our ability!

  • Learn in such a way that we can minimize errors for \{\mathbf x_0 , y_0 \} \not \in \{\mathbf X, \mathbf y \}

In other words, we want a generalizable approximation to the true data generating function!

Denote our approximating function as \hat{f}(\mathbf x)

  • Choose \hat{f}(\mathbf x) in such a way that we minimize the risk of any predictions

Predictive Models

The risk of an approximation is computed w.r.t. a loss function - a quantitative measure of how good we expect our predictions to be

Generally, we want to select \hat{f}(\mathbf x) that minimizes the actual expected risk with respect to the data generating distribution:

\hat{f}(\mathbf x) = \underset{f^*(\mathbf x)}{\text{argmin }} E_\mathcal T[L(f(\mathbf x) , f^*(\mathbf x))]

where f^*(\mathbf x) is a candidate function of the feature vector.

Predictive Models

We have knowledge of:

  • What influences the true expected risk (the size of the training data, the complexity of the approximating function, etc.)

  • How to closely approximate the unknowable true expected risk (train/validation/test splits, K-fold cross validation, etc.) for any candidate function

The goal, then, is to choose the correct function out of a set of candidate functions!

Hypothesis Spaces

The search over all possible functions in \mathbb R^P is generally not possible

  • If we had a small and discrete set of possible feature values and outcomes, then this might be possible

Instead, we’re almost always working with an infinite hypothesis space

  • The total number of points at which we would want to make a prediction is infinite

  • Just think about one continuous feature!

Hypothesis Spaces

Formally, let \mathcal H be the hypothesis space of our search for \hat{f}(\mathbf x)

  • \mathcal H includes all of the possible functions that we’ll examine

  • Choose f^*(\mathbf x) \in \mathcal H that minimizes the generalization error

Typically \mathcal H is restricted to one class of model

  • We can compare \mathcal H and \mathcal H' after the fact

Hypothesis Spaces

The most common \mathcal H is the set of linear models

Linear model

Given a set of P features, a linear model is one where:

\hat{y}_0 = g(\phi(x_0)^T \hat{\boldsymbol \beta})

  • g(\cdot) is some function that maps the input to the output

Thinking in terms of predictions from a training set, a linear model is one where:

\hat{y}_0 = g(\mathbf S_0 \mathbf y)

where \mathbf y is the collection of N training outcomes and \mathbf S_0 is a N \times N weighting matrix that is a function of \mathbf x_0 and \mathbf X.

Hypothesis Spaces

The most common \mathcal H is the set of linear models

Linear model

For a standard linear model with no feature transformations fit via OLS:

\hat{y}_0 = \mathbf x_0^T \hat{\boldsymbol \beta}

\hat{y}_0 = \mathbf x_0^T (\mathbf X^T \mathbf X)^{-1} \mathbf X^T \mathbf y

\mathbf S_0 = \mathbf x_0^T (\mathbf X^T \mathbf X)^{-1} \mathbf X^T

Hypothesis Spaces

We need to be really clear that linear models do not only include linear prediction surfaces!

Feature extraction

Let \mathbf x be a vector of features. Then, define the feature extractor as a function of the features:

\phi(\mathbf x)

With clever choice of feature extractor, the linear model can be used to induce really complex functions!

  • Additionally, as long as \phi(\mathbf x) has a fixed set of parameters, then we can still compute any coefficients using linear methods

Hypothesis Spaces

For a single feature model, we can define a feature extractor as the polynomial expansion of degree d

\phi(\mathbf x) = [1,x,x^2,...,x^d]

which induces the standard polynomial regression model.

For multifeature models, polynomial expansions tend to grow in size exponentially with the degree of the polynomial

  • Need to consider both expansions for each term and interactions between different features

  • Fully saturated models have roughly {N + d} \choose {d} terms!

Get around this with clever kernel functions like the polynomial kernel or RBF kernel

Hypothesis Spaces

Hypothesis Spaces

Hypothesis Spaces

With good choices for kernel functions, we can represent a large and complicated feature extractor with a N \times N matrix

Polynomial Kernel:

K(\mathbf x , \mathbf z) = (\mathbf x^T \mathbf z + 1)^d

corresponds to a feature expansion for each instance as:

[x_1^2,x_2^2,x_1 x_2, x_1 x_2^2, x1^2 x_2, x1^2 x2^2, x_1, x_2, 1]

for a second degree kernel.

Hypothesis Spaces

from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline

# Train a Polynomial Kernel Logistic Regression
model = make_pipeline(PolynomialFeatures(degree=15), LogisticRegression())
model.fit(X_train, y_train)

# Create a grid to evaluate model
x_min, x_max = X_train[:, 0].min() - 1, X_train[:, 0].max() + 1
y_min, y_max = X_train[:, 1].min() - 1, X_train[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.02),
                     np.arange(y_min, y_max, 0.02))

# Predict on the grid
Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

# Plot the contour and training examples
plt.contourf(xx, yy, Z, alpha=0.8)
#plt.scatter(X_train[:, 0], X_train[:, 1], c=y_train, cmap='viridis')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('15 degree polynomial kernel w/ Logistic Regression')
plt.show()

Hypothesis Spaces

Hypothesis Spaces

RBF Kernel:

K(\mathbf x , \mathbf z) = \exp \left[ - \frac{\| \mathbf x - \mathbf z \|^2}{2\sigma^2} \right]

corresponds to an feature expansion for each instance that includes all polynomials and interactions.

  • With small enough \sigma^2 guaranteed to yield zero training error!

Hypothesis Spaces

from sklearn.linear_model import LogisticRegression
from sklearn.kernel_approximation import RBFSampler
from sklearn.pipeline import make_pipeline

# Train a RBF Kernel Logistic Regression
model = make_pipeline(RBFSampler(gamma=1, random_state=42), LogisticRegression())
model.fit(X_train, y_train)

# Create a grid to evaluate model
x_min, x_max = X_train[:, 0].min() - 1, X_train[:, 0].max() + 1
y_min, y_max = X_train[:, 1].min() - 1, X_train[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.02),
                     np.arange(y_min, y_max, 0.02))

# Predict on the grid
Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

# Plot the contour and training examples
plt.contourf(xx, yy, Z, alpha=0.8)
#plt.scatter(X_train[:, 0], X_train[:, 1], c=y_train, cmap='viridis')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('RBF Kernel Logistic Regression (sig2 = 1)')
plt.show()

Hypothesis Spaces

Hypothesis Spaces

Note that work of making the linear model wiggly relies on choosing an appropriate feature extractor for the problem

For now, we’ll assume that the feature extractor is fixed and there is nothing related to it to learn

  • Neural networks are all about learning a complicated feature extractor

  • More to come soon.

Hypothesis Spaces

The linear model:

\hat{y} = g(\phi(\mathbf x)^T \boldsymbol \beta)

is a hypothesis space over all possible linear models that can be specified by different values of \boldsymbol \beta.

  • \boldsymbol \beta \in \mathbb R^P means that we have \infty^P possible models!

Hypothesis Spaces

We want to choose f^*(\mathbf x) \in \mathcal H that performs well for unseen data.

For the linear model, this amounts to choosing \hat{\beta} \in \mathbb R^P that minimizes the generalization error.

An approach:

  • Collect N training data instances
  • Select \hat{\boldsymbol \beta} that minimizes the generalization error \hat{\boldsymbol \beta} = \underset{\boldsymbol \beta^*}{\text{argmin }} E_{\mathcal T}[L(y , g(\phi(\mathbf x)^T \boldsymbol \beta^*))]

Empirical Risk Minimization

Our expectation:

E_{\mathcal T}[L(y , g(\phi(\mathbf x)^T \boldsymbol \beta^*))]

is a long run average. Given N samples from \mathcal T, we can approximate this expectation with the empirical risk:

\frac{1}{N} \sum \limits_{i = 1}^N L(y_i , g(\phi(\mathbf x_i)^T \boldsymbol \beta^*)) \overset{p}{\longrightarrow}E_{\mathcal T}[L(y , g(\phi(\mathbf x)^T \boldsymbol \beta^*))]

Therefore, it suffices to select coefficients that minimize the empirical risk or in-sample error.

\hat{\boldsymbol \beta} = \underset{\boldsymbol \beta^*}{\text{argmin }} \frac{1}{N} \sum \limits_{i = 1}^N L(y_i , g(\phi(\mathbf x_i)^T \boldsymbol \beta^*))

Empirical Risk Minimization

However, we know that the in-sample error when N is not infinity tends to underestimate the generalization error.

To deal with this asymmetry, it is common to add a penalty or regularizer term that accounts for the gap between the in-sample error and the generalization error:

\hat{\boldsymbol \beta} = \underset{\boldsymbol \beta^*}{\text{argmin }} \frac{1}{N} \sum \limits_{i = 1}^N L(y_i , g(\phi(\mathbf x_i)^T \boldsymbol \beta^*)) + \lambda C(\boldsymbol \beta^*)

where C(\cdot) is a measure of the model complexity and \lambda is a complexity knob that controls the penalty on complexity as a function of N (or just chosen using CV or other methods).

Empirical Risk Minimization

This definition is quite broad!

  • Any model, not just linear models, are solved using the ERM paradigm.

Defines a two-step training procedure:

  • Find the ERM for the model given any tuning parmaeters

  • Choose the tuning parameter to minimize the generalization error

Linear Models

Commonly seen linear models in ML are broken into three categories:

  • Continuous outcome: \phi(\mathbf x)^T \boldsymbol \beta \in \mathbb R^M

  • Binary outcome: g(\phi(\mathbf x)^T \boldsymbol \beta) \in (0,1)

  • Unordered categorical outcomes: g(\phi(\mathbf x)^T \boldsymbol \beta_k) \in (1,2,...,K)

There are other types of outcomes, but these are the most commonly seen

Linear Models

In each case, the coefficients are found using ERM with respect to a loss function.

Given training data, \mathbf X and \mathbf y, find the coefficients that minimize the loss:

Linear Regression

For a continuous outcome, find \hat{\boldsymbol \beta} that minimizes the means squared error loss:

\hat{\boldsymbol \beta} = \underset{\boldsymbol \beta^*}{\text{argmin }} \frac{1}{N} \sum \limits_{i = 1}^N (y_i - \phi(\mathbf x_i)^T \boldsymbol \beta^*)^2

Linear Models

Binary Logistic Regression

For a binary outcome, y_i \in (0,1) find \hat{\boldsymbol \beta} that minimizes the probability loss:

\hat{\boldsymbol \beta} = \underset{\boldsymbol \beta^*}{\text{argmin }} \frac{1}{N} \sum \limits_{i = 1}^N 1 - I(y_i = 1)\sigma(\phi(\mathbf x_i)^T \boldsymbol \beta^*) - I(y_i = 0)(1 - \sigma(\phi(\mathbf x_i)^T \boldsymbol \beta^*))

where \sigma(z) is the binary logistic function:

\sigma(z) = \frac{\exp[z]}{1 + \exp[z]}

and equates to the probability that y_i = 1 given our model choice and \mathbf x.

Linear Models

Multinomial Logistic Regression

For a categorical outcome, y_i \in (1,2,...,K) find \hat{\boldsymbol \beta_k} \forall k \in K that minimizes the probability loss:

\hat{\boldsymbol \beta} = \underset{\boldsymbol \beta^*}{\text{argmin }} \frac{1}{N} \sum \limits_{i = 1}^N \sum \limits_{k = 1}^K I(y_i \neq k) \sigma_k(\phi(\mathbf x_i)^T \boldsymbol \beta_k)

where there are K P-vectors of coefficients and \sigma_k(\mathbf z) is the softmax function for category k:

\sigma_k(\mathbf z) = \frac{\exp[z_k]}{\sum \limits_{h = 1}^K \exp[z_h]}

and equates to the probability that y_i = k given our model choice and \mathbf x.

Linear Models

Each of these are common linear models

  • Functions of a linear combination of coefficients and features

Though it may appear that these are all very different methods, they all actually fall under the umbrella of the generalized linear model framework

  • Though MNL is kinda not…

We’ll discuss this briefly, but the major implication is that each of these models corresponds to a maximum likelihood estimate from a distribution in the exponential family

  • Less words implication: there is a common ERM strategy that can be expressed via chained equations

Maximum Likelihood Estimation

We’ll start with a familiar setting:

Suppose we have N i.i.d. draws of training data that are drawn from \mathcal T. \mathcal T is a giant joint distribution.

The discriminative part:

Instead of worrying about the giant joint distribution, lets just think about the conditional distribution of y given some feature values.

Pr(y | \mathbf x) = f(\mathbf x) + \epsilon

Maximum Likelihood Estimation

Rather than leaving this in its arbitrary form, perhaps we’re willing to assume that Pr(y | \mathbf x) follows some known distributional form that is parameterized by a set of values, \theta.

Example: The Normal Distribution and Linear Regression

We assume that each y_i is generated from

y_i = \mathbf x_i^T \boldsymbol \beta + \epsilon_i

where \epsilon_i is an i.i.d. draw from a normal noise distribution with mean 0 and variance \sigma^2.

Maximum Likelihood Estimation

Using math of normal distributions, we can say that each y_i is a draw from a normal distribution with mean \mathbf x^T \boldsymbol \beta and variance \sigma^2.

Pr(y_i | \mathbf x_i , \boldsymbol \beta, \sigma^2) = \mathcal N(y_i | \mathbf x_i^T \boldsymbol \beta , \sigma^2)

\mathcal N(y_i | \mathbf x_i^T \boldsymbol \beta , \sigma^2) = \frac{1}{\sqrt{2 \pi \sigma^2}} \exp \left[-\frac{1}{2\sigma^2} (y_i - \mathbf x_i^T \boldsymbol \beta)^2 \right]

Given values of \mathbf x_i, y_i, \boldsymbol \beta, and \sigma^2, we could compute the probability/density with which we would expect to see y_i generated from this distribution!

Maximum Likelihood Estimation

Over a collection of feature/outcome pairs, we can easily compute the joint density with which we would expect to see all of the values if we assume that each of the draws are i.i.d. from a common data generating distribution

Pr(\mathbf y | \mathbf X , \boldsymbol \beta, \sigma^2) = \prod \limits_{i = 1}^N \mathcal N(y_i | \mathbf x_i^T \boldsymbol \beta , \sigma^2)

  • The joint probability of independent events is the product of the marginal probabilities.

Maximum Likelihood Estimation

We could compute this probability if we knew the parameters

  • But we don’t…

However, it makes sense that we could use this specification to choose optimal values of the parameters!

  • Choose values for \boldsymbol \beta and \sigma^2 that maximize the likelihood with which we would expect to see the values of the outcome given the features!

Maximum Likelihood Estimation

A few notes:

  • The joint likelihood, \ell(\boldsymbol \Theta | \mathbf X, \mathbf y) = Pr(\mathbf y | \mathbf x , \boldsymbol \Theta), ranges between 0 and 1
  • We want to choose values for the unknowns that maximize this value
  • The log likelihood, \ell \ell(\boldsymbol \Theta | \mathbf X , \mathbf y) = \log Pr(\mathbf y | \mathbf x , \boldsymbol \Theta), preserves any local maxima structure and ranges between -\infty and 0.
  • If the joint likelihood is a product, then the log is a sum of logs.
  • The negative log-likelihood, -\ell \ell(\boldsymbol \Theta | \mathbf X , \mathbf y) = -\log Pr(\mathbf y | \mathbf x , \boldsymbol \Theta) ranges from 0 to \infty

Maximum Likelihood Estimation

The negative log-likelihood ranges from 0 to \infty and only achieves zero if all outcomes are perfectly explained with no variation by the features and the parameters

This means that the negative log-likelihood is a proper loss function and can be used to choose an ERM!

  • The NLL is a proper scoring rule. See 14.2 in PML2.

  • Minimize the NLL to get the ERM.

Multiple Linear Regression

Let’s show the MLE for \boldsymbol \beta for a linear regression model.

\hat{\boldsymbol \beta} = \underset{\boldsymbol \beta^*}{\text{argmax }}\ell(\boldsymbol \beta^* | \mathbf X , \mathbf y , \sigma^2)

Define the joint likelihood:

\ell(\boldsymbol \beta^* | \mathbf X , \mathbf y , \sigma^2) = \prod \limits_{i = 1}^N \frac{1}{\sqrt{2 \pi \sigma^2}} \exp \left[-\frac{1}{2\sigma^2} (y_i - \mathbf x_i^T \boldsymbol \beta)^2 \right]

Compute the log likelihood:

\ell\ell(\boldsymbol \beta^* | \mathbf X , \mathbf y , \sigma^2) = \sum \limits_{i = 1}^N - \log \sqrt{2 \pi \sigma^2} -\frac{1}{2\sigma^2} (y_i - \mathbf x_i^T \boldsymbol \beta)^2

Multiple Linear Regression

Note that this is a sum of inner products. We can reexpress this as:

\ell\ell(\boldsymbol \beta^* | \mathbf X , \mathbf y , \sigma^2) = - N \log \sqrt{2 \pi \sigma^2} - \frac{1}{2\sigma^2} (\mathbf y - \mathbf X \boldsymbol \beta^*)^T(\mathbf y - \mathbf X \boldsymbol \beta^*)

Taking the derivative w.r.t. \boldsymbol \beta^* and setting to \mathbf 0:

-\frac{1}{2\sigma^2}\left[-2\mathbf X^T \mathbf y + 2 \mathbf X^T \mathbf X \boldsymbol \beta^*\right] = \mathbf 0

Solving:

\hat{\boldsymbol \beta} = (\mathbf X^T \mathbf X)^{-1} \mathbf X^T \mathbf y

Multiple Linear Regression

Two interesting points:

  • The MLE for the coefficients is the same as minimizing the mean squared error

  • The negative log likelihood is just MSE up to a constant!

-\ell\ell(\boldsymbol \beta^* | \mathbf X , \mathbf y , \sigma^2) = C + \frac{1}{N} \sum \limits_{i = 1}^N (y_i - \mathbf x_i^T \boldsymbol \beta)^2

Maximizing the likelihood of the Gaussian regression model is equivalent to minimizing the MSE.

Multiple Linear Regression

A final point:

We can express the linear regression model as a chain of equations.

-\log Pr(\mathbf y | \mathbf X, \boldsymbol \beta , \sigma^2) = -\sum \log \mathcal N(y_i | z_i , \sigma^2)

z_i = \mathbf x_i^T \boldsymbol \beta

This is useful because of the derivative chain rule:

\frac{\partial Pr(y | \mathbf x, \boldsymbol \beta , \sigma^2)}{\partial \boldsymbol \beta} = \frac{\partial (y | \mathbf x, \boldsymbol \beta , \sigma^2)}{\partial z} \frac{\partial z}{\partial \boldsymbol \beta}

  • Not needed here, but soon.

Binary Logistic Regression

Can we do the same thing for binary logistic regression?

To make logistic regression work, we assume that each 0/1 outcome is an i.i.d. draw from a Bernoulli distribution:

Pr(y_i | \mathbf x_i , \boldsymbol \beta) = \theta_i^{y_i} (1 - \theta_i)^{1 - y_i}

where

\theta_i = \frac{\exp[z_i]}{1 + \exp[z_i]}

and

z_i = \mathbf x_i^T \boldsymbol \beta

Binary Logistic Regression

Pr(y_i | \mathbf x_i , \boldsymbol \beta) = \theta_i^{y_i} (1 - \theta_i)^{1 - y_i}

  • \theta_i corresponds to the probability that y_i = 1 given the features we’ve observed and our choice of model

Take a minute:

Can you derive the negative log likelihood for the Bernoulli distribution in terms of \theta?

\ell(\boldsymbol \beta | \mathbf X , \mathbf y) = \prod \limits_{i = 1}^N \theta_i^{y_i} (1 - \theta_i)^{1- y_i}

Binary Logistic Regression

The binary cross-entropy:

-\frac{1}{N} \sum \limits_{i = 1}^N y_i \log \theta_i + (1 - y_i) \log (1 - \theta_i)

is a viable measure of loss for two-class problems!

  • A sort of measure of the probability we’re incorrect as a function of correct probabilities

Binary Logistic Regression

Can we find an analytical solution for the MLE coefficients?

Let’s use the chain rule:

\frac{\partial - \ell \ell(\boldsymbol \beta | \mathbf X , \mathbf y)}{\partial \boldsymbol \theta} \frac{\partial \boldsymbol \theta}{\partial \mathbf z} \frac{\partial \mathbf z}{\partial \boldsymbol \beta}

A note:

Because our initial objective function is a sum, we can split this problem into a sum of derivatives:

\sum \limits_{i = 1}^N \frac{\partial - \ell \ell(\boldsymbol \beta | \mathbf x_i , y_i)}{\partial \theta_i} \frac{\partial \theta_i}{\partial z_i} \frac{\partial z_i}{\partial \boldsymbol \beta}

Binary Logistic Regression

Let’s start with the first term:

\frac{\partial - \ell \ell(\boldsymbol \beta | \mathbf x_i , y_i)}{\partial \theta_i} = - \left(\frac{y_i}{\theta_i} - \frac{1 - y_i}{1 - \theta_i} \right)

The second term:

\frac{d}{d z_i}\frac{\exp[z_i]}{1 + \exp[z_i]}

can be solved using the logarithmic derivative identity:

f(w) \frac{d}{dw} \log f(w) = \frac{d}{dw} f(w)

Binary Logistic Regression

Applying this:

\frac{\exp[z_i]}{1 + \exp[z_i]}\left[1 - \frac{\exp[z_i]}{1 + \exp[z_i]} \right] = \theta_i(1 - \theta_i)

The third term:

\frac{\partial \mathbf x_i^T \boldsymbol \beta}{\partial \boldsymbol \beta} = \mathbf x_i

Binary Logistic Regression

Putting this all together:

\frac{\partial - \ell \ell(\boldsymbol \beta | \mathbf X , \mathbf y)}{\boldsymbol \beta} = \sum \limits_{i = 1}^N - \left(\frac{y_i}{\theta_i} - \frac{1 - y_i}{1 - \theta_i} \right) \theta_i(1 - \theta_i) \mathbf x_i

which can be simplified to:

\frac{\partial - \ell \ell(\boldsymbol \beta | \mathbf X , \mathbf y)}{\boldsymbol \beta} = \sum \limits_{i = 1}^N - (y_i - \theta_i) \mathbf x_i

Plugging in our definition of \theta_i:

\frac{\partial - \ell \ell(\boldsymbol \beta | \mathbf X , \mathbf y)}{\boldsymbol \beta} = - \sum \limits_{i = 1}^N \left( y_i - \frac{\exp[\mathbf x_i^T \boldsymbol \beta]}{1 + \exp[\mathbf x_i^T \boldsymbol \beta]} \right) \mathbf x_i

Set this equal to zero and solve for \boldsymbol \beta!

Binary Logistic Regression

That’s not going to work…

  • There isn’t really an analytical solution to this problem.

However, we have a gradient for the coefficients and can implement gradient descent algorithms to find the minimum!

  • More next class.

Multinomial Logistic Regression

We can set multinomial logistic regression up in a similar way leveraging a categorical distribution

\log Pr(y_i = k | \theta_i) = \sum \limits_{k = 1}^K I(y_i = k) \log \theta_k

\theta_k = \frac{\exp[z_k]}{\sum \limits_{h = 1}^K \exp[z_h]}

z_k = \mathbf x_i^T \boldsymbol \beta_k

  • Like logistic regression, there is a unique (with an additional constraint) MLE

  • We just can’t find it analytically

  • The gradients for the coefficients are left for your 2nd homework

Linear Models

Linear models are the most common machine learning method because of their ease of implementation

  • Find a likelihood model for the outcome that corresponds to your loss function of choice

  • Use the likelihood to find a set of coefficients for the P features included in \phi(\mathbf x)

It’s really that simple!

They are also so popular because they are easy to regularize

Regularization

Regularization is a practice in convex optimization that converts an ill posed optimization problem to one that is “regular” and can be solved via standard optimization methods.

  • Take a gnarly discrete optimization problem and turn it into a continuous one

It has also found a home in the machine learning community as it can be used to force models fit via ERM to find solutions that generalize well (or at least better than the true ERM solution)

  • The continuous equivalent turns into a complexity knob

Regularization

We can define a regularized loss function as one that adds an additional penalty to the loss as the complexity of the model increases. With a negative log-likelihood, the penalized loss function becomes:

\text{NLL} + \lambda C(\boldsymbol \Theta)

where C(\cdot) is a function that increases as the model becomes more complex and \lambda is a ttunable knob.

Regularization

For linear models with continuous outcomes, we can define the complexity of the model in terms of the projection matrix that maps the training features to the outcomes.

  • Remember that this matrix encodes the self-correlation between the training outcome and its prediction

  • When we overfit, we’re relying too heavily on the training data

\text{Generalization Error} = \text{Training Error} + \frac{2}{N} \sum Cov[y_i, \hat{y}_i]

\hat{y} = \mathbf S \mathbf y

Regularization

For OLS:

\mathbf S = \mathbf X (\mathbf X^T \mathbf X)^{-1} \mathbf X^T

\frac{2}{N}\sum Cov[y_i, \hat{y}_i] = \frac{2\sigma^2}{N}\text{tr}(\mathbf S) = \frac{2\sigma^2P}{N}

Regularization

The complexity of a linear model has something to do with the number of coefficients!

  • So, our loss function should try to correct for this difference by placing a penalty on the number of parameters

  • Account for the difference between the empirical risk and the true risk

Regularization

The problem, though, revolves around how to place the penalty.

Approach 1:

Choose M \le P features to keep in the model and select the one with the lowest generalization error. Try for all possible combinations or perform greedy search

  • Stepwise selection or L0 penalization

Problems?

Regularization

Rather than directly choosing features to leave in or out, regularize the search to create a path of models between the empty model and the full model estimated via OLS.

The path over models is over a proper norm of the coefficient vector

  • Smaller norms mean less complex models/less features

  • Larger norms mean more complex models with more features

Regularization

Two common norm choices:

  • The squared L2 norm (Ridge regression or weight decay)

\text{NLL} + \lambda \|\boldsymbol \beta\|_2^2

  • The L1 norm (the LASSO)

\text{NLL} + \lambda \|\boldsymbol \beta\|_1

The effect of implementing this norm penalty is that the norm of the ERM coefficients shrinks towards zero

Both choices try to replicate the gap between training error and generalization error!

Regularization

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.404
Model:                            OLS   Adj. R-squared:                  0.365
Method:                 Least Squares   F-statistic:                     10.49
Date:                Thu, 23 Jan 2025   Prob (F-statistic):           7.26e-09
Time:                        13:12:25   Log-Likelihood:                -364.25
No. Observations:                 100   AIC:                             742.5
Df Residuals:                      93   BIC:                             760.7
Df Model:                           6                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          3.0391      0.958      3.172      0.002       1.136       4.942
x1            -3.5738     12.524     -0.285      0.776     -28.443      21.296
x2            -3.5332      9.878     -0.358      0.721     -23.150      16.083
x3            21.2895     12.638      1.685      0.095      -3.807      46.386
x4             3.7228      9.049      0.411      0.682     -14.247      21.693
x5             0.8834      9.869      0.090      0.929     -18.715      20.481
x6           -14.4625      9.005     -1.606      0.112     -32.345       3.420
==============================================================================
Omnibus:                        1.960   Durbin-Watson:                   2.160
Prob(Omnibus):                  0.375   Jarque-Bera (JB):                1.502
Skew:                          -0.289   Prob(JB):                        0.472
Kurtosis:                       3.163   Cond. No.                         37.8
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Regularization

Regularization

Regularization

Regularization

We see that when \lambda is small, we largely recover OLS

As \lambda increases, the ERM for both norm penalties shrinks

  • We’re adding more penalty for having large coefficients

Finally, when \lambda is large, there is set of coefficients that yields a small enough MSE to offset the complexity penalty so they are all set to zero.

Regularization

All in all, this strategy works quite well for yielding more generalizable linear models

  • Use cross-validation or other approaches to choose a value of \lambda given the penalty that minimizes generalization error

  • There always exists some non-zero \lambda that yields lower generalization error for L2 penalty and MSE

  • Most of the time, L1 will decrease generalization error.

Regularization

Why does regularization work?

  • We’re penalizing the complexity and this allows sparser coefficient vectors to minimize the empirical risk

But, the penalties seem kinda arbitrary

  • Why the L2 norm? Why the L1? Why does this work?
  • Good properties. Not the only way to do this.
  • L2 is nice because it is differentiable

Personally, I’ve found all non-Bayesian explanations to be quite underwhelming…

  • A discussion for another day

Next Time

How do we actually find the coefficients for a logisitic regression model?

  • Gradients and Hessians

  • Gradient Descent and Stochastic Gradient Descent

  • 2nd order methods

  • Adaptive momentum methods